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Abstract 



■ The lasso has become so ubiquitous in statistics that it can be considered a paradigm 

' for inference and prediction. During the last fifteen years, the lasso procedure has been 

the target of a substantial amount of theoretical and applied research. Correspondingly, 
many results are known about its behavior for a fixed or optimally chosen smoothing 
parameter. These results, such as sparsistency and risk consistency, are of some comfort 
when using the lasso in applications. However, much less is known about its behavior 
when the smoothing parameter is chosen in a data dependent way, which is almost al- 
ways the case in practice. To this end, we give the first definitive answer about the risk 
^ . consistency of lasso when the smoothing parameter is chosen empirically using the same 

data that are used to fit the lasso estimator. We show that under restrictions on the 
design matrix, the lasso estimator is still risk consistent when the smoothing parameter 
is chosen via cross-validation. 



Keywords: Stochastic equicontinuity, uniform convergence, leave-one-out. 

1 Introduction 



^ \ Since its introduction in the statistical [1] and signal processing [2] communities, the lasso 



has become a fixture as both a data analysis tool (cf. [3, 4, 5]) and as an object for deep 
theoretical investigations (cf. [6, 7, 8]). To fix ideas, suppose that the observational model 
is of the form 

Y = X6 + aW. (1) 

where Y = (Yi, . . . ,Y n ) is the vector of responses and X = (X±, . . . , X n ) is the design 
matrix 1 . 

Under (1), the lasso estimator, 0(X), is defined to be the minimizer of the following 
functional 

0(A) :=argmin||y-X0||! + A||0||i. (2) 

9 

Here, A > is a tuning parameter controlling the trade-off between fidelity to the data 
(small A) and sparsity (large A). 



1 Assuming the linear model is not necessary as we can interpret (1) as the linear oracle projection of the 
regression function onto the column space of X. The results in this paper hold under either interpretation. 
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It is well known that under conditions on the matrix X, noise vector W, and the pa- 
rameter 9, the optimal choice of A leads to risk consistency [9]. However, arguably the 
most crucial aspect of any procedure's performance is the selection of the tuning param- 
eters. Theoretical results, such as those in [8], depend critically on the choice of tuning 
parameters and typically specify rates of decay so that if A = A n goes to zero at the cor- 
rect rate then 9(X n ) will be risk consistent. For the regularized problem in (2), taking 
An = o((log(n)/n) 1 / 2 ) gives consistency under very general conditions. However, this type 
of theoretical guidance says nothing about the properties of the lasso when the tuning 
parameter is chosen in data-dependent way. 

The standard technique for selecting A in the lasso problem is to choose A = X n such 
that A n minimizes the leave-one-out cross-validation (which we refer to in this paper simply 
as cross-validation) estimator of the risk. Finally, 6(X n ) is returned as the estimate of 9 (cf. 
[10, 11, 12]). 

In the analagous situation of estimation using splines, the tuning parameter is usually 
selected by generalized cross-validation. In this case, the asymptotic behavior with the 
tuning parameter chosen in a data dependent way is well known [13]. 

However, the vast literature on the lasso is strangely silent on the theoretical behavior 
of the cross validation estimator. The heuristic understanding of the performance of 9(X n ) 
is perhaps best encapsulated in Peter Biihlmann's statement: 

Regarding the choice of the regularization parameter, we typically use X n from 
cross-validation. 'Luckily', empirical and some theoretical indications support 
[good performance]... [12, Biihlmann's comments]. 

At the same time, various authors have discovered issues with some procedures for model 
selection in lasso. For example, [14] show that using prediction accuracy to choose the tuning 
parameter fails to recover the sparsity pattern consistently in an orthogonal design setting. 
Furthermore, [15] show that sparsity inducing algorithms like lasso are not algorithmically 
stable. In other words, leave-one-out risk estimates are not uniformly close to each other. 
This seems to suggest that cross-validation may not be risk consistent. However, stability is 
a sufficient, but not a necessary, condition for risk consistency. Our result partially resolves 
this antagonism between stability and risk consistency for lasso with the tuning parameter 
selected via cross-validation. 

In this paper we provide the first definitive answer about the risk consistency of lasso 
with an empirically selected tuning parameter by showing, under restrictive conditions on 
the design matrix, that the lasso estimator is risk consistent when A is chosen via cross 
validation. In section 2 we introduce our notation, model, and state our main theorem. 
In section 3 we state some results necessary for our proof methods, while in section 4 we 
provide the proof. Lastly, in section 5 we mention some implications of our main theorem 
and conclude. 

2 Notation, assumptions, and main results 

Suppose we get observations under the following ANOVA-like observational model 

Y ij = 9 j +aW lj . (3) 

Here i = 1, . . . ,n, j = 1, . . . ,p, where W%j N(0, 1). We assume 9 £ W but put no other 
restrictions on the parameter space. 
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Znj = Oj + ^W nj . (4) 



Define Z n j = ^-/n^2 i Yij, then 



To motivate this model, consider a design matrix X such that X T X = nl p , where I p is 
the identity matrix. Then the lasso estimator has the particularly tractable form 

%(A) = Bgn(%(0))(|%(0)|-A)+ 

where 9(0) = 6*(A) | a=o is the ordinary least squares estimator. 

As 9(0) ~ N (9 , a 2 (X T X) -1 ) , this corresponds to the following low-noise, normal means 
problem 

0(0) = 9 + -?=W (5) 
V n 

where W is a p dimensional standard normal. 

By sufficiency, all three models in (3), (4), and (5) are equivalent. Therefore, in partic- 
ular, results involving risk comparisons made using the model in (3) hold for (5) as well. 
While the orthogonal design condition may seem strong, it is in fact not much different 
from standard practice in the literature. We return to this point in section 5 below. 

We define the predictive risk and the leave-one-out cross-validation estimator of risk to 

be 

R n (X):=E [Y) J9(X)-9\\ 2 + a 2 and R n (X) : = £ ± ^(^(A) - ^) 2 , 



respectively. Here we are using 9y(X) to indicate the lasso estimator 9j(X) computed using 
all but the i th observation. 

Lastly, let A be a large, compact subset of [0, oo) the specifics of which are unimportant. 

In practical situations, any A 6 
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max.,- 9j(0), ooj will result in the same solution, namely 
(A) = for all j, so any large finite upper bound is sufficient. Then define 

A n := argmini? ra (A) and X n := argmini? n (A) 

AeA A 

to be the minimizer of the cross-validation estimator of R n and R n , respectively. Note that 
A n must be going to zero as n — > oo. Hence, for some N, n > N implies A n 6 A. Therefore, 
without loss of generality, we assume that X n £ A for all n. 

We spend the balance of this paper proving and discussing the following: 

Theorem 2.1 (Main Theorem). Under model (3) and for any 9 £ MP, 

Rn(K) ~ Rn(Xn) -> 0. (6) 

To prove this theorem, we show that sup AgA \ R n (X) — R n (X)\ — > in probability. Then 
(6) follows as 



Rn(^n) — Rn(^n) — yRn(^n) ~ Rn(^n)j + yR-n(^n) ~ Rn(^n, 

< I R n (X n ) — R n (X n ) I + ( R n (X n ) — R n (X ni 



< sup R n (X) - R n (X) + sup ( R n (X) - R n (X) 

AeA V 7 AGA 

= Op(l) + Op(l) = op(l). 
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Hence, it is sufficient to show that sup AgA (i? n (A) — R n (X)) = op(l). In fact, the term 
Rn(^n) — Rn{^n) is non-stochastic (the expectation in the risk integrates out the randomness 
in the data) and therefore convergence in probability implies sequential convergence and 
hence op(l) = o(l). 



The quantity 



-Rn(A) — -Rn(A) 



can be written 



R n (X)-R n (X) < (o) + 2(6) + (c) 



(7) 



where, (a) = ^X)j^(A) 2 -E(9,-(A) 2 , (b) 



iy.y. 2 



. We address (a) and (b) in lexicographic order in section 4. The last 



part, (c), is independent of A, and is op(l) by the weak law of large numbers. 



3 Preliminary material 

Our proof of Theorem 2.1 uses two results relating uniform convergence in probability and 
stochastic equicontinuity [16]. We state these preliminary results here before giving our 
proof. 

Suppose that we are interested in estimating some functional of a parameter /3, Q n (/3) 
using Q n (/3) where /3 G B. 

Definition 3.1 (Stochastic equicontinuity). If for every e, r/ > there exists a random 
variable A n (e,r]) and constant no(e, rj) such that for n > rto(e, rj), P(|A n (e, rj)\ > e) < rj 
and for each (3 G B there is an open set Af(f3, e, rf) containing j3 such that for n > no(e, rj), 



sup 

i3>eN{P,e,ri) 



QnW) ~ QnW) <A n (e,rj), 



then we call {Q n } stochastically equicontinuous over B. 

Theorem 3.2 (Theorem 2.1 in [16]). If B is compact, \Q n (/3) -Q n (j3)\^= o P (l) for each 
(3 G B, and {Q n } is stochastically equicontinuous over B, then sup^gg \Qn{P) — Q n (j3)\ = 



1 • 



Corollary 3.3 (Corollary 2.2 in [16]). IfB is a compact metric space with metric d, \Q n (/3) — 
QnW)\ = °p(1) f or a H /3 G B, and there exist B n and h such that B n = Op(l) and for all 
P',p€B, \Q n (p')-Q n (p)\<B n h(d(p',P)), then su P/3eB \Q n {p) - Q n (f3)\ = o P (l). 



4 Proofs 

Theorem 4.1 (Part (a)). 



sup 

AeA 



i£^(A) 2 -E^ 



Op(l). 



The proof of Theorem 4.1, follows by Lemma 4.2 and Lemma 4.3, which we state here. 
Each handles one term of the decomposition 



sup 

AeA 



1 



±££f(A) 2 -E0,(A) ; 



< sup 
AeA 



1 



+ sup 

AeA 



?j(A) 2 - E^, (A) 2 
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Lemma 4.2. 



sup 

AeA 



n 



Proof of Lemma 1^.2. 



sup 

AeA 



£^(A) 2 -%(A) 5 



1 



<sup±£^(A) 2 -^ 
n i=i 1 

1 " 

<-^su P (|zW|-A) 2 + -(|Z ni |-A) 



™^AeA 



For any i, the supremum is achieved at A = 0. 2 Therefore, 



E-p (i^Si-A) 2 + -(Kii-A) 2 =^E|(^) 2 -(^) 



r(»)\2 



i=l 



since both (Z^) 2 and (Z n j) 2 converge to 2 . 



Lemma 4.3. 



snp \e j{xy -E9j{xy\ = o P (i). 

AeA 



□ 



Proof of Lemma 4-3. To show this claim, we invoke Corollary 3.3. We need to show both 
that there 3B n ,h such that B n = Op(l), h : [0, oo) — > [0, oo) with h(0) = 0, and for all 
Ai,A 2 G A, \{\Z n j\ - Ai) 2 - (\Z n j\ - A 2 ) 2 | < -B n /i(|Ai - A 2 |). We must also show that 

^(A) 2 - E6*j(A) 2 | 4 pointwise over AeA. 

Let D := |(|Z n j| — Ai) 2 ^ — (|^ n j| — Aa)+|- Suppose without loss of generality, Ai > A 2 . 
Then, there are 3 cases. 



(i) \Z nj \ > Ai. Then D < 2\Z nj \\X 1 - A 2 | < 2Z 2 j \X 1 - A 2 | = O p (1)/i(|Ai - A 2 |), where 
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B n = Z 2 nj = Op(1) and = 2x. 



(ii) l^y l < A 2 . Then ,0 = 0. 

(hi) A 2 < \Z n j\ < Ai. Then D = (|-Z n j| — A 2 )+ < Z 2 j, so (in) is bounded by (z). 

As pointwise convergence follows immediately from the weak law of large numbers, Corollary 3.3 
is satisfied. □ 

Theorem 4.4 (Part (b)). 

op(l). 



sup 

AeA 



l -Y J ^{X)Y l3 -m 3 {X)9 j 
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2 As a function of A, (\Z n j\ — A) + and (\Z*^j \ — A)+ both look like the left half of a quadratic function 



with the vertices at \Z n j\ and \Z^\ respectively. 



5 



To prove this theorem, there are two relevant cases: when 9j = and 9j ^ 0. When 



9j = 0, we have sup AeA W Y ij < I sup AeA 9y (A) 



. But, 



< 



op(l) by the weak law of large numbers and Slutsky's theorem. 
If 6j 0, however, convergence analysis is trickier as we are dealing with random 
functions of random variables. Here, 



sup 

AeA 



1 



< sup 

AeA 



^0f(A)^-%(A)Z nj 



3(0 



+ sup \9j(X)Z n j — M9j(X)9j 
AeA 1 



To deal with the second term, we need the following lemma. 
Lemma 4.5. 



sup 

AeA 



9j(X)Z nj - E9j(X)9j 







(8) 



Proof of Lemma 4-5. We show that the sequence (9j(X)Z n j) n ^ converges pointwise as a 
function of A to E9j(X)9j and forms a stochastic equicontinuous set. Additionally, we show 
K9j(X)9j forms an equicontinuous set. Then, by Theorem 3.2, (8) follows. 

For pointwise convergence, suppose <p n is the Gaussian pdf with mean 9j and variance 
a 2 /n and define f(X,z) := sgn(z)(|z| — A)+. Then 

9j(X)Z nj - E9 j (X)9 j \ < \9j(X)Z nj - f{9 3 , X)Z nj \ + |/(%, X)Z nj - E%(A)^-| = o P (l) 
as |^(A) - f(9j, A) | = o P (l) and 

E»i(A) = / /(z, A)0 n (z)dz -> /(%, A) 



due to <f> n being an approximate identity centered at 9j and / continuous in A. 

To show stochastic equicontinuity, we need for all e, i] > 0, 3A n (e, if) and constant n(e, rj) 
such that for all n > n(e,rj), P(|A n | > e) < rj and there exists and open set 0(X n ,e,r]) 
containing X n such that 

sup |sgn(z)(|z| - X n ) + Z nj - sgn(z)(\z\ - X)+Z nj \ < A n (e,rj). 

Let e, r] > be given. Choose A n = \Z n j\/n, then A n — > 0, which implies 3n(e, r/) such that 
for n > n(e, rf) 

P(|A n | > e) < 77. 

Also, choose C(A n , e, r/) = [A n — 1/n, A n + 1/n]. Then 



sup \sgxi(Z nj )(\Z nj \ - X n ) + Z nj - sgn(Z nj )(\Z nj \ - X) + Z nj \ < 



'nj\ SUp 

AeO(A n ,e,r;) 



An)+ ~~ (\Znj\ ~ A)+| 



A, 



showing stochastic equicontinuity. 
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For equicontinuity, fix e > 0. As /(■, z) is a continuous function, 3S(z) > such that for 
| Ai — A2I < 5(z), \ f(\\,z) — f(\2,z)\ < e. Also, due to the piecewise linear structure of /, 
35 such that the above holds for 5 = 5(z). Therefore, 



E6 j (X)= I sgn(z) (\z\ - X) +( p n (z)dz = I f(X,z)cj> n (z)dz 
Let |Ai — A2I < 5. Then 

9j / f(\i 1 z)4> n (z)dz - 9j I f{\ 2 ,z)<t) n {z)dz 



< \0j\ / \f(X 1 ,z)-f(X 2 ,z)\ ( j )n ^)dz 
(j) n (z)dz = \6j\e. 



Thus, as \0A is fixed and finite, we have established equicontinuity. Therefore, Lemma 4.5 



follows from Theorem 3.2. 

Now we are in a position to prove Theorem 4.4. 
Proof of Theorem 4-4- For 6j ^ 0, 



□ 



(6) < sup 
aga 



+ sup 

AGA 



i(\)z nj -Wj(\)ej 



(9) 



To bound the first term on the right hand side of (9), note that 



sup 

aga 



- V ef{\)Y ij - e 3 {\)z n] < - V sup ef{\)Y ij - e^x)^ 



n z — ' 



sup 

aga 



^(A)-^(A) 



We bound sup AeA #(A) - %(A) 



by 



sup 

aga 



(A)-%(A) 



< sup 

aga 



sgn(Z<J)(|Z« 



A) + -sgn(zg)(|Z„ 



+ sup 

aga 



nj I - A). 

agn(Z nj )(\Z nj \ - A)+ - sgn(Zm(\Z nj \ - A). 



3 |(|Z«|-A) + -(|Z ni |-A). 

s mi z nj) - sgn(Z nj )\ sup |(|Z ni | - A)_ 



aga 



aga 



nj 



\z 



+ I sgn(Z, 



"J 

nj J 



sgn(Z, 



^> 



as I Z, 



13 



and |sgn(Z^-) — sgn(Z n j)| by virtue of 9j 7^ 0. Then, 



nj ■ 



^ £^ \Yij\ supAeA \6j (^) ~~ 0j'C\)l = op(l). The second term in (9) converges in probability 



to by Lemma 4.5. The result follows. 



□ 



This series of proofs shows that for the worst A S A, (a), (b), and (c) in the decomposition 
of (7) are each op(l). Theorem 2.1 follows immediately. 
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5 Discussion and future work 



A common practice in data analysis is to estimate the coefficients of a linear model via lasso 
and choose the regularization parameter via cross-validation. Unfortunately, no definitive 
theoretical results exist as to the effect of choosing the tuning parameter in this data- 
dependent way. 

In this paper, we provide a partial solution to the problem by demonstrating, under 
particular assumptions on the design matrix, that the lasso is risk consistent even when the 
tuning parameter is selected via leave-one-out cross-validation. 

We should first note that there is nothing particularly special about leave-one-out cross- 
validation. In fact, the proof of consistency we have given holds for fc-fold cross-validation 
as well. The size of the left out set is only an issue in Lemma 4.2 and Theorem 4.4. In both 
cases, the only requirement for the result to hold is that k — > oo with n, i.e. the size of the 
left out set cannot be 0{n) (leave-one-out corresponds to ra-fold cross-validation). So for 
instance k = logra will still lead to risk consistency. 

Assuming X T X = nl p , i.e. orthogonal design, is a strong assumption. However, putting 
conditions on the condition number of the design matrix is not new in the lasso literature. 
For example, a canonical assumption is that X has the restricted isometry property [17, 18]. 
These conditions ensure that the design matrices satisfy a mild generalisation of orthogonal 
matrices in which the columns are almost locally orthogonal rather than globally perfectly 
orthogonal. These conditions are necessary for the consistent estimation of the parameter 
vector, as they prevent disastrous collinearities. However, as we are interested solely in 
risk consistency rather than parameter consistency, strong conditions on the design matrix 
should not be necessary. We conjecture that risk consistency under cross-validation holds 
for arbitrary design matrices. 
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